Before using R to illustrate basic programming concepts and data analysis tools, we will get familiar with the RStudio layout.
Rstudio contains 4 panes
RStudio has four primary panels that will help you interact with your data. We will use the default layout of these panels.
Source panel: Top left
Edit files to create ‘scripts’ of code
Console panel: Bottom left
Accepts code as input
Displays output when we run code
Environment panel: Top right
Everything that R is holding in memory
Objects that you create in the console or source panels will appear here
You can clear the environment with the broom icon
Viewer panel: Bottom-right
View graphics that you generate
Navigate files
Illustration
Let’s use these panels to create and interact with data.
Console:
Perform a calculation: type 2 + 2 into the console panel and hit ENTER
Create and store an object: type sum = 2 + 2 into the console panel and hit ENTER
Source:
Start an R script: Open new .R file (button in top-left below “File”)
Create and store an object: type sum = 2 + 3 into the source panel and hit cntrl+ENTER
Environment:
Confirm that the object sum is stored in our environment
Use rm(sum) to clear the object from the environment
Clear the environment with the broom icon
Viewer:
Navigate through your computer’s files
Create a plot in the source panel
Review of Basic Programming Concepts
Now that we understand the layout, we are ready to review the concepts covered in Module 2 Week 2.2. These concepts will help us understand what is happening when we create and manipulate data.
Objects: where values are saved in R
“Object” is a generic term for anything that R stores in the environment. This can include anything from an individual number or word, to lists of values, to entire datasets.
Importantly, objects belong to different “classes” depending on the type of values that they store.
Characters are text or strings like "hello world" and "welcome to R".
Factors are a group of characters/strings with a fixed number of unique values
Logicals are either TRUE or FALSE
# Create a numeric objectmy_number =5.6# Check the classclass(my_number)
[1] "numeric"
# Create a character objectmy_character ="welcome to R"# Check the classclass(my_character)
[1] "character"
# Create a logical objectmy_logical =FALSE# Check the classclass(my_logical)
[1] "logical"
R can perform operations on objects.
# Create a numeric objectmy_number =5.6# Check the classclass(my_number)
[1] "numeric"
# Perform a calculationmy_number +5
[1] 10.6
The class of an object determines the type of operations you can perform on it. Some operations can only be run on numeric objects (numbers).
# Create a character objectmy_number ="5.6"# Check the classclass(my_number)# Perform a calculationmy_number +5round(my_number)
R contains functions that can convert some objects to different factors.
# Convert character to numericmy_number =as.numeric("5")class(my_number)
[1] "numeric"
# But R is only so smartmy_number =as.numeric("five")class(my_number)
[1] "numeric"
Data Structures
The most simple objects are single values, but most data analysis involves more complicated data structures.
Lists
Lists are a type of data structure that store multiple values together. Lists are created using c() and allow you to perform operations on a series of values.
# Create a numeric list (also called a "vector")numeric_vector =c(6, 11, 13, 31)# Print the vectorprint(numeric_vector)
[1] 6 11 13 31
# Check the classclass(numeric_vector)
[1] "numeric"
# Calculate the meanmean(numeric_vector)
[1] 15.25
An important part of working with more complex data structures is called “indexing.” Indexing allows you to extract specific values from a data structure.
# Extract the 2nd element from the listnumeric_vector[2]
[1] 11
# Extract elements 2-4numeric_vector[2:4]
[1] 11 13 31
# Extract elements 1-2numeric_vector[c(TRUE, TRUE, FALSE, FALSE)]
[1] 6 11
Dataframes
Data frames are the most common type of data structure used in research. Data frames combine multiple lists of values into a single object.
# Create a dataframemy_data =data.frame(x1 =rnorm(100, mean =1, sd =1),x2 =rnorm(100, mean =1, sd =1))class(my_data)
[1] "data.frame"
Anything that comes in a spreadsheet (for example, an excel file) can be loaded into an R environment as a dataframe. R works most easily when spreadsheets are saved as a .csv file.
In most data frames, rows correspond to observations and the columns correspond to variables that describe the observations. Here, we are looking at survey data from an RCT involving university students in Addis Ababa. Each row correspondents to a different survey respondent, and each column represents their answers to a different question from the survey.
Loading Packages
Packages are an extremely important part of data analysis with R.
R gives you access to thousands of “packages” that are created by users
Packages contain bundles of code called “functions” that can execute specific tasks
Use install.packages() to install a package and library() to load a package
In the next section, we’ll use the package dplyr to perform some data cleaning. dplyr is part of a universe of packages called tidyverse. Since this is one of the most important packages in the R ecosystem, let’s install and load it.
Cleaning Data
In the real-world, data never comes ready to be analyzed. Data cleaning is the process of manipulating data so that it can be analyzed. This is usually the most difficult and time-consuming part of any data analysis project. Let’s walk through some examples.
Creating Variables
Imagine we want to analyze the relationship between whether a respondent moved to to Addis Ababa to attend university and their level of political participation. However, there are two problems:
We don’t have a specific variable that measures whether or not respondents moved
We have many measures of participation
How can we create a variable measuring whether the respondent moved to Addis Ababa? We have a multiple-choice question asking students about what region they come from.